Project Name:Advanced House Price Prediction¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.pandas.set_option('display.max_columns',None)
ds=pd.read_csv('train.csv')
print(ds.shape)
(1460, 81)
ds.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NaN | Attchd | 2003.0 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | NaN | 0.0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976.0 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162.0 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001.0 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | NaN | 0.0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998.0 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350.0 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000.0 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
In Data Analysis we will analyze to find out Missing values All the numerical variables Distribution of the numerical variables Categorical variables cardinality of categorical variables Outliers Relationship between independent and dependent features
Missing Values¶
#check the percentage of nan values present in each feature
#step 1 -the list of features which has missing values
features_with_na=[feature for feature in ds.columns if ds[feature].isnull().sum()>1]
##step-2 print the features name and the percentage of misssing values
for feature in features_with_na:
print(feature, np.round(ds[feature].isnull().mean(), 4), '%missing values')
LotFrontage 0.1774 %missing values Alley 0.9377 %missing values MasVnrType 0.5973 %missing values MasVnrArea 0.0055 %missing values BsmtQual 0.0253 %missing values BsmtCond 0.0253 %missing values BsmtExposure 0.026 %missing values BsmtFinType1 0.0253 %missing values BsmtFinType2 0.026 %missing values FireplaceQu 0.4726 %missing values GarageType 0.0555 %missing values GarageYrBlt 0.0555 %missing values GarageFinish 0.0555 %missing values GarageQual 0.0555 %missing values GarageCond 0.0555 %missing values PoolQC 0.9952 %missing values Fence 0.8075 %missing values MiscFeature 0.963 %missing values
Since they are many missing values ,we need to find the relationship between missing values and sales price¶
for feature in features_with_na:
data=ds.copy()
#lets make a variable that indicates 1 if the observation was missing or zero
data[feature]=np.where(data[feature].isnull(),1,0)
#lets calculate the mean Saleprice where the information missing or present
data.groupby(feature)['SalePrice'].median().plot.bar(color=['red','skyblue'] # first bar = red,second bar=skyblue
)
plt.title(feature)
plt.show()
Here with relation between the missing values and the dependent variables is clearly visible.So we need to replace these nan values with something meaningfull which we will do in the Feature Engineering selection From the above dataset soome of the features like id i not required
print("Id of Houses {}".format(len(ds.Id)))
Id of Houses 1460
Numerical Variables¶
#list of numerical variables
numerical_features=[feature for feature in ds.columns if ds[feature].dtypes != '0']
print('Number of numerical variables: ', len(numerical_features))
#visualise the numerical variables
ds[numerical_features].head()
Number of numerical variables: 81
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NaN | Attchd | 2003.0 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | NaN | 0.0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976.0 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162.0 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001.0 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | NaN | 0.0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998.0 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350.0 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000.0 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
Temporal variables (Eg:Datatime Variables)¶
From the Dataset we have 4 year variables.We have exact information from the datatime variables like no of yeard or no of dates.One example in this specific scenario can be difference in years between the year the house was built and the year house was sold. We will be performing this analysis in the Feature Engineering which is the next video
# List of variables that contain year information
year_feature=[feature for feature in numerical_features if 'Yr' in feature or 'Year' in feature]
year_feature
['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']
# let's explore the content of these year variables
for feature in year_feature:
print(feature,ds[feature].unique())
YearBuilt [2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 1965 2005 1962 2006 1960 1929 1970 1967 1958 1930 2002 1968 2007 1951 1957 1927 1920 1966 1959 1994 1954 1953 1955 1983 1975 1997 1934 1963 1981 1964 1999 1972 1921 1945 1982 1998 1956 1948 1910 1995 1991 2009 1950 1961 1977 1985 1979 1885 1919 1990 1969 1935 1988 1971 1952 1936 1923 1924 1984 1926 1940 1941 1987 1986 2008 1908 1892 1916 1932 1918 1912 1947 1925 1900 1980 1989 1992 1949 1880 1928 1978 1922 1996 2010 1946 1913 1937 1942 1938 1974 1893 1914 1906 1890 1898 1904 1882 1875 1911 1917 1872 1905] YearRemodAdd [2003 1976 2002 1970 2000 1995 2005 1973 1950 1965 2006 1962 2007 1960 2001 1967 2004 2008 1997 1959 1990 1955 1983 1980 1966 1963 1987 1964 1972 1996 1998 1989 1953 1956 1968 1981 1992 2009 1982 1961 1993 1999 1985 1979 1977 1969 1958 1991 1971 1952 1975 2010 1984 1986 1994 1988 1954 1957 1951 1978 1974] GarageYrBlt [2003. 1976. 2001. 1998. 2000. 1993. 2004. 1973. 1931. 1939. 1965. 2005. 1962. 2006. 1960. 1991. 1970. 1967. 1958. 1930. 2002. 1968. 2007. 2008. 1957. 1920. 1966. 1959. 1995. 1954. 1953. nan 1983. 1977. 1997. 1985. 1963. 1981. 1964. 1999. 1935. 1990. 1945. 1987. 1989. 1915. 1956. 1948. 1974. 2009. 1950. 1961. 1921. 1900. 1979. 1951. 1969. 1936. 1975. 1971. 1923. 1984. 1926. 1955. 1986. 1988. 1916. 1932. 1972. 1918. 1980. 1924. 1996. 1940. 1949. 1994. 1910. 1978. 1982. 1992. 1925. 1941. 2010. 1927. 1947. 1937. 1942. 1938. 1952. 1928. 1922. 1934. 1906. 1914. 1946. 1908. 1929. 1933.] YrSold [2008 2007 2006 2009 2010]
## lets analyze the temporal datetime variables
## we will check whether there is a relation between year the house is sold and
ds.groupby('YrSold')['SalePrice'].median().plot()
plt.xlabel('Year Sold')
plt.ylabel('Median House Price')
plt.title('House Price vs YearSold')
Text(0.5, 1.0, 'House Price vs YearSold')
year_feature
['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']
## Here we will compare the differnece between all years feature with SalePrice
for feature in year_feature:
if feature!='YrSold':
data=ds.copy()
##we will capture the difference between your variables and year of the house sold for
data[feature]=data['YrSold']-data[feature]
plt.scatter(data[feature],data['SalePrice'])
plt.xlabel(feature)
plt.ylabel('SalePrice')
plt.show()
## Numerical Variables are usually of 2 types
## 1.Continous variables and Discrete variables
discrete_feature=[feature for feature in numerical_features if len(ds[feature].unique())<25 and feature not in year_feature]
print("Discrete Variables Count: {}".format(len(discrete_feature)))
Discrete Variables Count: 59
discrete_feature
['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'LowQualFinSF', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars', 'GarageQual', 'GarageCond', 'PavedDrive', '3SsnPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'SaleType', 'SaleCondition']
ds[discrete_feature].head()
| MSSubClass | MSZoning | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinType2 | Heating | HeatingQC | CentralAir | Electrical | LowQualFinSF | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageFinish | GarageCars | GarageQual | GarageCond | PavedDrive | 3SsnPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | SaleType | SaleCondition | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 60 | RL | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | Norm | Norm | 1Fam | 2Story | 7 | 5 | Gable | CompShg | VinylSd | VinylSd | BrkFace | Gd | TA | PConc | Gd | TA | No | GLQ | Unf | GasA | Ex | Y | SBrkr | 0 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NaN | Attchd | RFn | 2 | TA | TA | Y | 0 | 0 | NaN | NaN | NaN | 0 | 2 | WD | Normal |
| 1 | 20 | RL | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | Feedr | Norm | 1Fam | 1Story | 6 | 8 | Gable | CompShg | MetalSd | MetalSd | NaN | TA | TA | CBlock | Gd | TA | Gd | ALQ | Unf | GasA | Ex | Y | SBrkr | 0 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | RFn | 2 | TA | TA | Y | 0 | 0 | NaN | NaN | NaN | 0 | 5 | WD | Normal |
| 2 | 60 | RL | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | Norm | Norm | 1Fam | 2Story | 7 | 5 | Gable | CompShg | VinylSd | VinylSd | BrkFace | Gd | TA | PConc | Gd | TA | Mn | GLQ | Unf | GasA | Ex | Y | SBrkr | 0 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | RFn | 2 | TA | TA | Y | 0 | 0 | NaN | NaN | NaN | 0 | 9 | WD | Normal |
| 3 | 70 | RL | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Norm | Norm | 1Fam | 2Story | 7 | 5 | Gable | CompShg | Wd Sdng | Wd Shng | NaN | TA | TA | BrkTil | TA | Gd | No | ALQ | Unf | GasA | Gd | Y | SBrkr | 0 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | Unf | 3 | TA | TA | Y | 0 | 0 | NaN | NaN | NaN | 0 | 2 | WD | Abnorml |
| 4 | 60 | RL | Pave | NaN | IR1 | Lvl | AllPub | FR2 | Gtl | Norm | Norm | 1Fam | 2Story | 8 | 5 | Gable | CompShg | VinylSd | VinylSd | BrkFace | Gd | TA | PConc | Gd | TA | Av | GLQ | Unf | GasA | Ex | Y | SBrkr | 0 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | RFn | 3 | TA | TA | Y | 0 | 0 | NaN | NaN | NaN | 0 | 12 | WD | Normal |
#Let's find the relationship between them and Sale Price
for feature in discrete_feature:
data=ds.copy()
data.groupby(feature)['SalePrice'].median().plot.bar(color=['red','blue','green','orange'])
plt.xlabel(feature)
plt.ylabel('SalePrice')
plt.title(feature)
plt.show()
## There is a relationship between variable number and Saleprice
Continous Variable¶
continous_feature=[feature for feature in numerical_features if feature not in discrete_feature+year_feature+['Id']]
print("Continous feature count {}".format(len(continous_feature)))
Continous feature count 17
#Lets analyze the continous values by creating histograms to understand
for feature in continous_feature:
data=ds.copy()
data[feature].hist(bins=25)
plt.xlabel(feature)
plt.ylabel("Count")
plt.title(feature)
plt.show()
Exploratory Data Analiysis¶
## we will be using logarithmic transformation
for feature in continous_feature:
data=ds.copy()
# skip if column is not numeric
if data[feature].dtype == 'object':
continue
# skip if column has 0 values
if 0 in data[feature].values:
continue
else:
data[feature]=np.log(data[feature])
data['SalePrice']=np.log(data['SalePrice'])
plt.scatter(data[feature],data['SalePrice'])
plt.xlabel(feature)
plt.ylabel('SalePrice')
plt.title(feature)
plt.show()
Outliers¶
for feature in continous_feature:
data=ds.copy()
# skip if column is not numeric
if data[feature].dtype == 'object':
continue
if 0 in data[feature].values:
continue
else:
data[feature]=np.log(data[feature])
data.boxplot(column=feature)
plt.ylabel(feature)
plt.title(feature)
plt.show()
Categorical Variables¶
categorical_features=[feature for feature in ds.columns if data[feature].dtype=='O']
categorical_features
['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
ds[categorical_features].head()
| MSZoning | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinType2 | Heating | HeatingQC | CentralAir | Electrical | KitchenQual | Functional | FireplaceQu | GarageType | GarageFinish | GarageQual | GarageCond | PavedDrive | PoolQC | Fence | MiscFeature | SaleType | SaleCondition | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | RL | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | Gable | CompShg | VinylSd | VinylSd | BrkFace | Gd | TA | PConc | Gd | TA | No | GLQ | Unf | GasA | Ex | Y | SBrkr | Gd | Typ | NaN | Attchd | RFn | TA | TA | Y | NaN | NaN | NaN | WD | Normal |
| 1 | RL | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | Gable | CompShg | MetalSd | MetalSd | NaN | TA | TA | CBlock | Gd | TA | Gd | ALQ | Unf | GasA | Ex | Y | SBrkr | TA | Typ | TA | Attchd | RFn | TA | TA | Y | NaN | NaN | NaN | WD | Normal |
| 2 | RL | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | Gable | CompShg | VinylSd | VinylSd | BrkFace | Gd | TA | PConc | Gd | TA | Mn | GLQ | Unf | GasA | Ex | Y | SBrkr | Gd | Typ | TA | Attchd | RFn | TA | TA | Y | NaN | NaN | NaN | WD | Normal |
| 3 | RL | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | Gable | CompShg | Wd Sdng | Wd Shng | NaN | TA | TA | BrkTil | TA | Gd | No | ALQ | Unf | GasA | Gd | Y | SBrkr | Gd | Typ | Gd | Detchd | Unf | TA | TA | Y | NaN | NaN | NaN | WD | Abnorml |
| 4 | RL | Pave | NaN | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | Gable | CompShg | VinylSd | VinylSd | BrkFace | Gd | TA | PConc | Gd | TA | Av | GLQ | Unf | GasA | Ex | Y | SBrkr | Gd | Typ | TA | Attchd | RFn | TA | TA | Y | NaN | NaN | NaN | WD | Normal |
for feature in categorical_features:
print('The feature is {} and number of categories are {}'.format(feature,len(ds[feature].unique())))
The feature is MSZoning and number of categories are 5 The feature is Street and number of categories are 2 The feature is Alley and number of categories are 3 The feature is LotShape and number of categories are 4 The feature is LandContour and number of categories are 4 The feature is Utilities and number of categories are 2 The feature is LotConfig and number of categories are 5 The feature is LandSlope and number of categories are 3 The feature is Neighborhood and number of categories are 25 The feature is Condition1 and number of categories are 9 The feature is Condition2 and number of categories are 8 The feature is BldgType and number of categories are 5 The feature is HouseStyle and number of categories are 8 The feature is RoofStyle and number of categories are 6 The feature is RoofMatl and number of categories are 8 The feature is Exterior1st and number of categories are 15 The feature is Exterior2nd and number of categories are 16 The feature is MasVnrType and number of categories are 4 The feature is ExterQual and number of categories are 4 The feature is ExterCond and number of categories are 5 The feature is Foundation and number of categories are 6 The feature is BsmtQual and number of categories are 5 The feature is BsmtCond and number of categories are 5 The feature is BsmtExposure and number of categories are 5 The feature is BsmtFinType1 and number of categories are 7 The feature is BsmtFinType2 and number of categories are 7 The feature is Heating and number of categories are 6 The feature is HeatingQC and number of categories are 5 The feature is CentralAir and number of categories are 2 The feature is Electrical and number of categories are 6 The feature is KitchenQual and number of categories are 4 The feature is Functional and number of categories are 7 The feature is FireplaceQu and number of categories are 6 The feature is GarageType and number of categories are 7 The feature is GarageFinish and number of categories are 4 The feature is GarageQual and number of categories are 6 The feature is GarageCond and number of categories are 6 The feature is PavedDrive and number of categories are 3 The feature is PoolQC and number of categories are 4 The feature is Fence and number of categories are 5 The feature is MiscFeature and number of categories are 5 The feature is SaleType and number of categories are 9 The feature is SaleCondition and number of categories are 6
## Findout the relationship between categorical variables and dependent feature i.e Salesprice
for feature in categorical_features:
data=ds.copy()
data.groupby(feature)['SalePrice'].median().plot.bar()
plt.xlabel(feature)
plt.ylabel('SalePrice')
plt.title(feature)
plt.show()